by Cory Lanker, LLNL statistician
June 26, 2017, 10:00-11:30 in B543 R1001
This course is a four-class introduction to predictive modeling. It is for all skill levels and will use R statistics software. But as the course is presented with R notebooks, all the code is here. The first three classes are lectures that discuss (Mon) the concepts useful for prediction, (Tues) regression models, and (Wed) classification models. The last class (Thurs) is a prediction contest where teams will use this material in a competition to get the best prediction values.
In this class, I'll discuss concepts that are important for effective prediction models. Prediction is a data analysis task that asks what a response variable is for an observation with known predictor values. We'll skip other data analysis tasks that explore the relationship between the response variable and its predictor variables.
This class consists of four R notebooks in Jupyter. Installation help:
In this class:
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-733567
This course is structured around Max Kuhn and Kjell Johnson's master's-level
text on prediction modeling. The code is either from the book, especially when
using their wonderful caret R package, or it's code I wrote. Their code is
disseminated under their copyright. All data sets in this course are all publicly available.
# http://www.R-project.org
# Copyright (C) 1995-2012 The R Core Team
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# A copy of the GNU General Public License is available at
# http://www.r-project.org/Licenses/
# functions for this page
plotsize <- function(wd,ht){options(repr.plot.width=wd, repr.plot.height=ht)}
packagelist <- c("MASS", "FNN", "ellipse", "AppliedPredictiveModeling", "Hmisc",
"missForest", "caret", "corrplot", "earth", "lattice", "e1071")
output <- sapply(packagelist, require, character.only = TRUE)
if (any(!output)){
cat('Installing packages: ', packagelist[!output],' ...\n')
install.packages(packagelist[!output])
output <- sapply(packagelist[!output], require, character.only = TRUE)
}
# sessionInfo()
Prediction is a data-centered task that I'll define:
# A toy example of nine observations with two predictors (n=9, p=2)
x1 = 1:9 * 2
set.seed(0) # if rerun, you get same results due to set.seed(0) setting RNG
x2 = round(rnorm(9,10,5),1)
y = 5 + 0.5 * x1 + round(rnorm(9),1)
example1 = data.frame(y,x1,x2)
for (i in 1:9) rownames(example1)[i] = rawToChar(as.raw(64+i))
# create data to be predicted
xt1 = -1:10 * 2 + 1
set.seed(1) # if rerun, you get same results due to set.seed(0) setting RNG
xt2 = round(rnorm(length(xt1),10,5),1)
truth = 5 + 0.5 * xt1 + round(rnorm(length(xt1)),1) # we don't know the truth
example1t = data.frame(y=NA,x1=xt1,x2=xt2) # our data has nothing for the response y
for (i in 1:length(xt1)) rownames(example1t)[i] = rawToChar(as.raw(73+i))
cat('Our data set, observations with y known:')
t(example1)
cat('Observations to be predicted:')
t(example1t)
plotsize(8,4)
par(mfrow=c(1,2))
plot(example1$x1, example1$y, pch=rownames(example1), xlim=c(0,20),
xlab='predictor 1, x1', ylab='response, y', main="plot of y vs. x1")
plot(example1$x2, example1$y, pch=rownames(example1), xlim=c(0,20),
xlab='predictor 2, x2', ylab='response, y', main="plot of y vs. x2")
Let's make a function $f(x1,x2)$ to predict $y$ that is a linear fit of $x1$ and $x2$. Then we will use the fit to predict $y$ for observations J through U.
lm.example1 = lm(y~., data=example1)
summary(lm.example1)
Is our linear fit $\hat{y} \equiv f(\mathbf x) = 6.98 + 0.36\, x_1 - 0.06\, x_2$ close enough to the true way $y$ is generated from $x_1$ and $x_2$, which is $y = 5 + 0.5\, x_1$?
Note that besides the formula $5 + 0.5\, x_1$,
truth consists of random noise, so it wouldn't be possible to predict the true values $y$.
The best we can hope to do is predict the expected value of $y$ given the predictors $X$ (that is, without
the additive noise). We'll see later that the error of our prediction values consists of an
irreducible error term $\sigma^2$ (the variance of the noise added to $y$ values) that is a penalty we accept we cannot overcome.
yhat = predict(lm.example1, example1t)
plotsize(8,4)
par(mfrow=c(1,2))
plot(example1$x1, example1$y, pch=rownames(example1), xlim=c(-2,22), ylim=c(3,15),
xlab='predictor 1, x1', ylab='response, y', main="plot of y vs. x1")
points(example1t$x1, yhat, pch=rownames(example1t), col='red', cex=1)
points(example1t$x1, truth, pch=rownames(example1t), col='orange', cex=.9)
points(example1t$x1, truth, col='orange', cex=2)
plot(example1$x2, example1$y, pch=rownames(example1), xlim=c(-2,22), ylim=c(3,15),
xlab='predictor 2, x2', ylab='response, y', main="plot of y vs. x2")
points(example1t$x2, yhat, pch=rownames(example1t), col='red', cex=1)
points(example1t$x2, truth, pch=rownames(example1t), col='orange', cex=.9)
points(example1t$x2, truth, col='orange', cex=2)
Comments:
cat(sprintf("The average absolute error for the linear fit is %.2f.\n",
mean(abs(yhat - truth))))
The trouble with prediction is that we don't know what our average absolute error is as we don't know the true values $y^*$ for observations J through U. Without that knowledge, it is easy to read too much into the data and make fancier models... too fancy given the true simplistic relationship between $\mathbf{x}$ and $y$.
Ignoring $x_2$, doesn't the data beg for a polynomial fit of $x_1$ to account for the apparent cubic relationship?
# rather than have two data sets (cumbersome), have a single concatenated data set
# where the first nine rows are the training data
if (nrow(example1) == 9){
example1 = rbind(example1, example1t)
example1$x12 = example1$x1 ^ 2
example1$x13 = example1$x1 ^ 3
}
cubic.example1 = lm(y ~ x1 + x12 + x13, data=example1[1:9,])
head(example1,3)
summary(cubic.example1)
The cubic fit is $\hat{y} = 10.89 - 1.56\, x_1 + 0.198\, x_1^2 - 0.00580\, x_1^3$, yielding predicted values:
Subtracting the mean may reduce variability in the estimates and increase $t$ values. For this case it's true for the linear coefficient.
A demeaned data matrix yields the cubic fit (will change $t$ values for importance for squared and linear coefficents), note though that the fit is the same:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.323810 0.187636 49.691 6.24e-08 ***
x1 0.664478 0.062771 10.586 0.00013 ***
x12 0.024107 0.005289 4.558 0.00607 **
x13 -0.005798 0.001229 -4.717 0.00526 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3713 on 5 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.9841, Adjusted R-squared: 0.9745
F-statistic: 103 on 3 and 5 DF, p-value: 6.482e-05
istest = is.na(example1$y)
yhat2 = predict(cubic.example1, example1[istest,])
plotsize(8,4)
par(mfrow=c(1,2))
plot(example1$x1[!istest], example1$y[!istest], pch=rownames(example1)[!istest],
xlim=c(-2,22), ylim=c(3,15), xlab='predictor 1, x1', ylab='response, y',
main="plot of fits vs. x1")
points(example1$x1[istest], yhat2, pch=rownames(example1)[istest],
col='red', cex=1)
points(example1$x1[istest], truth, pch=rownames(example1)[istest],
col='orange', cex=.9)
points(example1$x1[istest], truth, col='orange', cex=2)
xs = -20:220/10
ys = cbind(1, xs, xs^2, xs^3) %*% cubic.example1$coef
plot(example1$x1[!istest], example1$y[!istest], pch=16, cex=0.75, col='black',
xlim=c(-2,22), ylim=c(3,15), xlab='predictor 1, x1', ylab='response, y',
main=sprintf("average absolute error = %1.2f", mean(abs(yhat2-truth))))
lines(xs, ys, col='red')
for (j in which(istest)){lines(rep(example1$x1[j],2), c(yhat2[j-9],truth[j-9]),
col='blue', lwd=3)}
points(example1$x1[istest], truth, pch=rownames(example1)[istest],
col='orange', cex=.9)
points(example1$x1[istest], truth, col='orange', cex=2)
By making the model more complex, the average absolute error increased from 1.24 to 2.10, an increase of 70%. This increased error for the complex model may be counterintuitive.
Obviously it's not knowing the true values that makes selecting the most appropriate prediction model difficult! In this course I'll present strategies for effective prediction.
In contrast to a linear fit of the data, now we'll perform nearest neighbor regression on the above data set. Whereas a predicted value is affected by all data points in a linear fit, the nearest neighbor prediction uses only the closest observation. Specifically, let $\hat{y}$ be the value of $y_i$ such $distance(x_i, x^*)$ is the smallest for all $i=1,\dots,n$.
example1.1nn <- knn.reg(train=example1[1:9,2:3], test=example1[-(1:9),2:3],
y=example1$y[1:9], k=1)
str(example1.1nn)
istest = is.na(example1$y)
yhat3 = example1.1nn$pred
plotsize(8,4)
par(mfrow=c(1,2))
plot(example1$x1, example1$y, pch=rownames(example1), xlim=c(-2,22),
ylim=c(3,15), xlab='predictor 1, x1', ylab='response, y', cex=0.75,
main="plot of y vs. x1\nnn(red), linear(blue)")
points(example1t$x1, yhat3, pch=rownames(example1t), col='red', cex=1)
points(example1t$x1, truth, pch=rownames(example1t), col='orange', cex=.9)
points(example1t$x1, truth, col='orange', cex=2)
points(example1t$x1, yhat, pch=rownames(example1t), col='blue', cex=0.75)
for (j in which(istest)){lines(rep(example1$x1[j]-.2,2),
c(yhat3[j-9],truth[j-9]), col='red', lwd=4)}
for (j in which(istest)){lines(rep(example1$x1[j]+.2,2),
c(yhat[j-9],truth[j-9]), col='blue', lwd=4)}
plot(example1$x2, example1$y, pch=rownames(example1), xlim=c(-2,22), ylim=c(3,15),
xlab='predictor 2, x2', ylab='response, y', main="plot of y vs. x2")
points(example1t$x2, yhat, pch=rownames(example1t), col='red', cex=1)
points(example1t$x2, truth, pch=rownames(example1t), col='orange', cex=.9)
points(example1t$x2, truth, col='orange', cex=2)
cat(sprintf("average absolute error for nearest neighbor = %1.2f",
mean(abs(yhat3-truth))))
It's hard to compare performance of these prediction models given this setup, but we'll learn strategies to test which model is likely to generate better prediction values.
No prediction model will outperform another model on every possible data set
Therefore, an analyst must consider a broad set of models for a given prediction task, like finding the right tool in your prediction modeling toolbox
plotsize(6,4)
plot(datasets::cars, main='Stopping distance vs. vehicle speed')
t(datasets::cars)
xpred = 0:3000/100
speed.lm = lm(dist ~ speed, data=datasets::cars)
ypred.lm = predict(speed.lm, data.frame(speed=xpred))
summary(speed.lm)
# average brute (low value) and cover_tree (high value) --
# a quirk of the algorithm for duplicate x values
speed.1nn = knn.reg(train=datasets::cars[,1], test=as.matrix(xpred),
y=datasets::cars[,2], k=1, algorithm = "cover_tree")$pred
speed.1nn = 0.5*(speed.1nn +
knn.reg(train=datasets::cars[,1], test=as.matrix(xpred),
y=datasets::cars[,2], k=1, algorithm = "brute")$pred)
plotsize(8,4)
plot(datasets::cars, main='Stopping distance vs. vehicle speed',
xlim=c(1,29), ylim=c(-25,125))
lines(xpred, ypred.lm, col='blue')
lines(xpred, speed.1nn, col='red')
abline(h=0, lty=2)
Three tasks for data analysis are
All involve this conceptual representation:
$\; \; y \leftarrow \big[$ nature $\big] \leftarrow x$
We conceptualize our data set consisting of $n$ observations of $\mathbf{x}_i = \left[x_{i1}, x_{i2}, \dots, x_{ip}\right]$ each having a measured response $y_i$:
$\; \; y_i \leftarrow \big[$ nature $\big] \leftarrow \mathbf{x}_i , \; i=1,\dots,n$
Our task at hand is to predict what the unknown response $y^*_i$ is for an observed $\mathbf{x}^*_i = \left[x_{i1}^*, x_{i2}^*, \dots, x_{ip}^*\right]$ for $m$ observations $i=n+1,\dots,n+m$ :
$\; \; y_i^* \leftarrow \big[$ nature $\big] \leftarrow \mathbf{x}_i^* , \; i=n+1,\dots,n+m $
An approach to the task of prediction is by characterizing the structure of $\mathbf{x}$ and its relationship to $y$:
Some considerations:
how many data are there in the span of $\mathbf{X}$?
do the observations to be predicted $\mathbf{X}^*$ span the same space as $\mathbf{X}$?
how are the observations $\mathbf{x}_i$ related to one another?
is it reasonable to assume that the black box is the same for all observations $\mathbf{x}_i$?
The task of prediction may seem as simple as (1) choosing a model, (2) plugging in data, and (3) generating prediction values for new data --- but rarely is it that straightforward.
Here are three factors that complicate prediction.
if (!exists("runnumber")) runnumber = 1
# run first time with n=10, then n=40
# set 'expand' to TRUE to double the density of observations
if (runnumber == 1) expand = FALSE else expand = TRUE
runnumber = 3 - runnumber
set.seed(0)
x <- runif(10)
y <- runif(10)
z <- 0.5 + rt(10,3)/5
data3d <- data.frame(x=x,y=y,z=z)
if (expand == TRUE){
x <- runif(30)
y <- runif(30)
z <- 0.5 + rt(30,3)/5
datanew <- data.frame(x=x,y=y,z=z)
data3d <- rbind(data3d, datanew)
}
plotsize(4,4)
cz <- (data3d$z - min(0,min(data3d$z)))/max(1,diff(range(data3d$z)))
cloud(z~x+y,data=data3d,col=rgb(0.1+0.8*cz,0.25+0.5*cz,0.9-0.8*cz), cex=1,
pch=16, xlim=c(0,1), ylim=c(0,1), zlim=c(0,1))
pred <- data.frame(x=rep(1:19/20, each=19),y=rep(1:19/20,19))
data3d.lm <- lm(z ~ ., data=data3d)
pred.fit <- predict(data3d.lm, pred)
cz <- (pred.fit - 0.1)/0.8
cloud(pred.fit ~ pred$x + pred$y, col=rgb(0.1+0.8*cz,0.25+0.5*cz,0.9-0.8*cz),
cex=0.5, xlim=c(0,1), ylim=c(0,1), zlim=c(-.2,1.2))
data3d[1:10,]
cor(data3d[1:10,])
cor(data3d)
With only 10 points in this cube, it's hard to avoid seeing relationship between $z$ and the predictors $x$ and $y$, even though there is no relationship. As we gather more observations though, it's easier to see the lack of any relationship.
n=160 # Now with 4x density as the n=10 case
set.seed(0)
x <- runif(n)
y <- runif(n)
#z <- 1+0.5*x+0.5*y+rnorm(n)
z <- 0.5 + rnorm(n)/4
data3d <- data.frame(x=x,y=y,z=z)
plotsize(4,4)
cz <- (data3d$z - min(0,min(data3d$z)))/max(1,diff(range(data3d$z)))
cloud(z~x+y,data=data3d,col=rgb(0.1+0.8*cz,0.25+0.5*cz,0.9-0.8*cz), cex=1,
pch=16, xlim=c(0,1), ylim=c(0,1), zlim=c(0,1))
pred <- data.frame(x=rep(1:19/20, each=19),y=rep(1:19/20,19))
data3d.lm <- lm(z ~ ., data=data3d)
pred.fit <- predict(data3d.lm, pred)
cz <- (pred.fit - 0.1)/0.8
cloud(pred.fit ~ pred$x + pred$y, col=rgb(0.1+0.8*cz,0.25+0.5*cz,0.9-0.8*cz),
cex=0.5, xlim=c(0,1), ylim=c(0,1), zlim=c(-.2,1.2))
Lesson: more data coverage over the predictor space allows more complex models
Question 1: Say a predictor $x$ takes values uniformly between 0 and 1. If we want to make a prediction at $x=0.5$ and use all data within 0.1 away (between 0.4 and 0.6), what proportion of the data is available for making our prediction value?
n = 1000
set.seed(0)
x = runif(n)
sum(abs(x - 0.5) < .1)/n
Let's add predictor $u$ also taking values uniformly between 0 and 1. If we want to make a prediction at $(x,u)=(0.5,0.5)$ and use all data within Euclidean distance 0.1 from this point, now what proportion of the data is available?
n = 1000
set.seed(0)
x = matrix(runif(2*n),n,2)
sum(sqrt(rowSums((x - c(0.5,0.5))^2)) < .1)/n
The exact answer is $\pi(0.1)^2 \approx 0.0314$. If we keep adding dimensions, what would this proportion be for a variety of distances from the center of the unit cube?
distlist = c(0.1, 0.25, 0.50, 0.75, 1)
resmat = matrix(NA, 20, length(distlist))
n = 1e6
for (p in 1:nrow(resmat)){
set.seed(p)
x = matrix(runif(p*n),n,p)
euclid_dist <- sqrt(rowSums((x - rep(0.5,p))^2))
for (d in 1:ncol(resmat))
resmat[p,d] = sum(euclid_dist < distlist[d])/n
}
plotsize(7,4)
plot(1:20, resmat[,1], type='l', ylim=c(0,1), ylab='Proportion within distance d',
xlab='number of predictors on (0,1)', las=1, lwd=2)
for (d in 2:ncol(resmat))
lines(1:20, resmat[,d], col=d+(d>4), lwd=2)
legend(c(15,20),c(.3,1),c("d=0.1","d=0.25","d=0.5","d=0.75","d=1"),
lty=1, col=c(1:4,6), lwd=2, cex=1.2, y.intersp=2.5)
title('Data near cube center becomes sparse as dimension increases')
Lesson: as dimension of predictor space $\mathbb{R}^p$ increases, it's harder for the data to adequately cover the space (so overfitting is easy to do)
Predicting winter clothing prices or sales may not be too successful if using only summer clothes data.
How accurate will prediction of 2018 MLB player statistics be using only 2017 data? The last five years? The last 50 years? What do you think would yield the best prediction model?
Assessment techniques that we'll learn will not rescue us from this pitfall. Performance validation assesses how well we will predict observations from the same system (the same $\mathbf{X}$ predictor space).
It may be better to think of ways future data could be different and test sensitivity to those differences that aren't in the prediction model.
Lesson: be careful to check that the new data to be predicted is similar to the data used for training the prediction model
################################################################################
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
### Web Page: http://www.appliedpredictivemodeling.com
### Contact: Max Kuhn (mxkuhn@gmail.com)
###
### Chapter 3: Data Pre-Processing
###
### Required packages: AppliedPredictiveModeling, e1071, caret, corrplot
###
### Data used: The (unprocessed) cell segmentation data from the
### AppliedPredictiveModeling package.
###
################################################################################
################################################################################
### Section 3.1 Case Study: Cell Segmentation in High-Content Screening
library(AppliedPredictiveModeling)
data(segmentationOriginal)
## Retain the original training set
segTrain <- subset(segmentationOriginal, Case == "Train")
## Remove the first three columns (identifier columns)
segTrainX <- segTrain[, -(1:3)]
segTrainClass <- segTrain$Class
dim(segTrain)
head(segTrain,10)
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 3.2 Data Transformations for Individual Predictors
## The column VarIntenCh3 measures the standard deviation of the intensity
## of the pixels in the actin filaments
max(segTrainX$VarIntenCh3)/min(segTrainX$VarIntenCh3)
skewness(segTrainX$VarIntenCh3)
set.seed(0)
nv = rnorm(200)
skewness(nv)
skewness(exp(nv))
plotsize(8,3)
par(mfrow=c(1,3))
hist(nv, main='Histogram of Gaussian draws')
hist(exp(nv),12, main='exp(Gaussian) draws')
hist(segTrainX$VarIntenCh3)
skewvals <- sapply(segTrainX, skewness)
plotsize(4,3)
hist(skewvals,12)
a = which.min(skewvals); a
b = which.max(skewvals); b
plotsize(8,4)
par(mfrow=c(1,2))
hist(segTrainX[,a], 20, main=sprintf('%s skew=%.2f', names(a), skewvals[a]))
hist(segTrainX[,b], 20, main=sprintf('%s skew=%.2f', names(b), skewvals[b]))
Some modeling techniques work better if the predictors have no skew (such as a Gaussian distribution). Even more important is that the distribution of the response variable is reasonable for the chosen prediction model.
# (C) Kuhn and Johnson, 2013
## Use caret's preProcess function to transform for skewness
segPP <- preProcess(segTrainX, method = "BoxCox") # caret::preProcess
## Apply the transformations
segTrainTrans <- predict(segPP, segTrainX)
#str(segPP) # segPP is a collection of transformations (Output is long!)
## Results for a single predictor
segPP$bc$VarIntenCh3
preProcess(data.frame(a=segTrainX$KurtIntenCh1), method="BoxCox")
# (C) Kuhn and Johnson, 2013
plotsize(3,3)
histogram(~segTrainX$VarIntenCh3,
xlab = "Natural Units",
type = "count")
histogram(~log(segTrainX$VarIntenCh3),
xlab = "Log Units",
ylab = " ",
type = "count")
segPP$bc$PerimCh1
histogram(~segTrainX$PerimCh1,
xlab = "Natural Units",
type = "count")
histogram(~segTrainTrans$PerimCh1,
xlab = "Transformed Data",
ylab = " ",
type = "count")
Standardizing predictors centers the variable such mean = 0 and then scales the variable to have variance = 1.
Standardization is important because a few prediction methods require matrix columns to have zero mean and unity variance.
set.seed(0); n=100
mat = data.frame(x1=rnorm(n,50,10), x2=rexp(n,0.04))
mat$y = 0.5 * mat$x1 + 0.25 * mat$x2 + rnorm(n,0,10)
head(mat)
transf = preProcess(mat, method="center")
#transf = preProcess(mat, method=c("center","scale"))
#transf = preProcess(mat, method=c("BoxCox"))
#transf = preProcess(mat, method=c("BoxCox","center","scale"))
transf
mat2 = predict(transf, mat)
plotsize(8,6)
par(mfrow=c(3,2))
for (i in 1:3){
hist(mat[,i], main=sprintf('Original variable %s', names(mat)[i]))
hist(mat2[,i], main=sprintf('Transformed variable %s', names(mat)[i]))
}
head(segTrainTrans)
rbind(apply(segTrainTrans, 2, mean), apply(segTrainTrans, 2, sd))
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 3.3 Data Transformations for Multiple Predictors
## R's prcomp is used to conduct PCA
pr <- prcomp(~ AvgIntenCh1 + EntropyIntenCh1,
data = segTrainTrans,
scale. = TRUE)
# Note: segTrainTrans is the transformed data
transparentTheme(pchSize = .7, trans = .3)
plotsize(6,4)
xyplot(AvgIntenCh1 ~ EntropyIntenCh1,
data = segTrainTrans,
groups = segTrain$Class,
xlab = "Channel 1 Fiber Width",
ylab = "Intensity Entropy Channel 1",
auto.key = list(columns = 2),
type = c("p", "g"),
main = "Original Data: 2 variable PCA",
aspect = 1)
# (C) Kuhn and Johnson, 2013
xyplot(PC2 ~ PC1,
data = as.data.frame(pr$x),
groups = segTrain$Class,
xlab = "Principal Component #1",
ylab = "Principal Component #2",
main = "Transformed: 2 variable PCA",
xlim = extendrange(pr$x),
ylim = extendrange(pr$x),
type = c("p", "g"),
aspect = 1)
## Apply PCA to the entire set of predictors.
# Note: this is a lot of predictors for PCA
dim(segTrainX)
head(apply(segTrainX, 2, sd))
## There are a few predictors with only a single value, so we remove these first
## (since PCA uses variances, which would be zero)
isZV <- apply(segTrainX, 2, function(x) length(unique(x)) == 1)
segTrainX <- segTrainX[, !isZV]
segPP <- preProcess(segTrainX, c("BoxCox", "center", "scale"))
segTrainTrans <- predict(segPP, segTrainX)
# now the data are centered and scaled
rbind(apply(segTrainTrans, 2, mean), apply(segTrainTrans, 2, sd))
# (C) Kuhn and Johnson, 2013
segPCA <- prcomp(segTrainTrans, center = TRUE, scale. = TRUE)
## Plot a scatterplot matrix of the first three components
transparentTheme(pchSize = .8, trans = .3)
panelRange <- extendrange(segPCA$x[, 1:3])
plotsize(6,6)
splom(as.data.frame(segPCA$x[, 1:3]),
groups = segTrainClass,
type = c("p", "g"),
as.table = TRUE,
auto.key = list(columns = 2),
prepanel.limits = function(x) panelRange)
Among the 116 predictors, it's easy to see why the first three principal components help separate the data into the ps and ws classes.
splom(as.data.frame(segPCA$x[, c(1,4,5)]),
groups = segTrainClass,
type = c("p", "g"),
as.table = TRUE,
auto.key = list(columns = 2),
prepanel.limits = function(x) panelRange)
It's not as easy to see how PC4 and PC5 help differentiate the classes. Note the strong blue shapes of PC1 vs. PC4 and PC5, meaning PC4 and PC5 do offer help in discrimination of ws from ps.
str(segPCA) # Note that center and scale must be 0 and 1 for PCA
plotsize(6,3)
varexpl = (segPCA$sdev[1:30]^2)/sum(segPCA$sdev^2)
plot(varexpl, ylim=c(0,max(varexpl)),
cex=.75, type='l', ylab='Percent of Total Variance',
main="Scree plot: explained variance percentage")
points(varexpl, pch=16, cex=.75)
abline(h=0, v=5.5, lty=2)
After PC4, no other principal component explains more than four percent of the explained discrimination.
plotsize(5,5)
splom(as.data.frame(segPCA$x[, c(1,6)]),
groups = segTrainClass,
type = c("p", "g"),
as.table = TRUE,
auto.key = list(columns = 2),
prepanel.limits = function(x) panelRange)
isx = (segPCA$x[,1] > 0 & segPCA$x[,1] < 1.5)
table(segTrainClass[isx])
isx2 = (segPCA$x[,1] > 0 & segPCA$x[,1] < 1.5 &
segPCA$x[,6] > 1 & segPCA$x[,6] < 1.5)
table(segTrainClass[isx2])
PC6 is still informative, as shown above. Without PC6, classifying PS and WS for values of PC1 between 0 and 1.5 would have a misclassification score of 37/146 = 25.3%. Using PC6, we improve the misclassification score in this range to 33/146 = 22.6%. The other principal components make small contributions to improving the overall classifier.
# (C) Kuhn and Johnson, 2013
## Format the rotation values for plotting
segRot <- as.data.frame(segPCA$rotation[, 1:3])
## Derive the channel variable
vars <- rownames(segPCA$rotation)
channel <- rep(NA, length(vars))
channel[grepl("Ch1$", vars)] <- "Channel 1"
channel[grepl("Ch2$", vars)] <- "Channel 2"
channel[grepl("Ch3$", vars)] <- "Channel 3"
channel[grepl("Ch4$", vars)] <- "Channel 4"
# to understand the previous code:
head(vars)
head(grepl("Ch1$", vars))
head(segRot)
segRot$Channel <- channel
head(segRot)
complete.cases(segRot)
tail(segRot)
segRot <- segRot[complete.cases(segRot),]
segRot$Channel <- factor(as.character(segRot$Channel))
# (C) Kuhn and Johnson, 2013
## Plot a scatterplot matrix of the first three rotation variables
transparentTheme(pchSize = .8, trans = .7)
panelRange <- extendrange(segRot[, 1:3])
library(ellipse)
upperp <- function(...)
{
args <- list(...)
circ1 <- ellipse(diag(rep(1, 2)), t = .1)
panel.xyplot(circ1[,1], circ1[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
circ2 <- ellipse(diag(rep(1, 2)), t = .2)
panel.xyplot(circ2[,1], circ2[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
circ3 <- ellipse(diag(rep(1, 2)), t = .3)
panel.xyplot(circ3[,1], circ3[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
panel.xyplot(args$x, args$y, groups = args$groups,
subscripts = args$subscripts)
}
splom(~segRot[, 1:3],
groups = segRot$Channel,
lower.panel = function(...){}, upper.panel = upperp,
prepanel.limits = function(x) panelRange,
auto.key = list(columns = 2))
Many methods cannot handle missing data and omit those observations from portions of their calculations. When a data set has many predictors and only one item is missing (NA), it doesn't make sense to ignore all the other information that observation can provide. Therefore we can impute those values using the mean or nearby observations.
The Hmisc function allows multivariate imputation. There are several comparable imputation functions for R.
See Alzola CF, Harrell FE (2004): An Introduction to S and the Hmisc and Design Libraries at http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf for extensive documentation and examples for the Hmisc package.
colSums(is.na(segRot))
set.seed(0)
segRot.mis <- prodNA(segRot, noNA=0.1)
is.mis <- which(rowSums(is.na(segRot.mis)) > 0)
head(segRot.mis[is.mis,],10)
rbind(colMeans(segRot.mis[,1:3],na.rm=T), apply(segRot.mis[,1:3],2,sd,na.rm=T))
DImodel <- aregImpute(~ PC1+PC2+PC3, data=segRot.mis)
segRot.impute = segRot.mis
segRot.impute[is.na(segRot.mis[,1]),1] = rowMeans(DImodel$imputed$PC1)
segRot.impute[is.na(segRot.mis[,2]),2] = rowMeans(DImodel$imputed$PC2)
segRot.impute[is.na(segRot.mis[,3]),3] = rowMeans(DImodel$imputed$PC3)
head(segRot.impute[is.mis,],10)
plotsize(8,3)
par(mfrow=c(1,3))
for (j in 1:3){
isx = is.na(segRot.mis[,j])
plot(segRot.impute[isx,j], segRot[isx,j], asp=1)
abline(0,1,col='red',lty=2)
}
DImodel2 <- aregImpute(Channel ~ PC1 + PC2 + PC3, data=segRot.impute)
str(DImodel2$cat.levels)
cbind(DImodel2$imputed$Channel, segRot$Channel[is.na(segRot.mis$Channel)])
Given the limited data set size, the imputation isn't poor. For 7 of 10 data points, the true missing Channel class has at least as many imputed guesses as the other Channel classes...
Correlated variables cause a variety of problems with prediction methods. For example, collinearity (correlation near $\pm 1$ caused linear fit coefficients to be highly variable and degrades the meaning of variable importance in a random forest. The following functions remove correlated variable from data in preparation for analysis.
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 3.5 Removing Variables
## To filter on correlations, we first get the correlation matrix for the
## predictor set
segCorr <- cor(segTrainTrans)
plotsize(8,8)
corrplot(segCorr, order = "hclust", tl.cex = .35)
# (C) Kuhn and Johnson, 2013
## caret's findCorrelation function is used to identify columns to remove.
highCorr <- findCorrelation(segCorr, .75)
highCorr
segCorr2 <- segCorr[-highCorr,-highCorr]
plotsize(6,6)
corrplot(segCorr2, order = "hclust", tl.cex = .35)
Factor variables (categorical levels) should be turned into indicator variables (1/0 values based on membership).
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 3.8 Computing (Creating Dummy Variables)
data(cars)
type <- c("convertible", "coupe", "hatchback", "sedan", "wagon")
cars$Type <- factor(apply(cars[, 14:18], 1, function(x) type[which(x == 1)]))
head(cars)
# Note that the car's model has already been converted into dummy variables
# Buick, Cadillac, etc.
set.seed(0)
carSubset <- cars[sample(1:nrow(cars), 20), c(1, 2, 19)]
head(carSubset)
levels(carSubset$Type)
simpleMod <- dummyVars(~Mileage + Type,
data = carSubset,
## Remove the variable name from the column name
levelsOnly = TRUE)
simpleMod
head(predict(simpleMod, carSubset))
withInteraction <- dummyVars(~Mileage + Type + Mileage:Type,
data = carSubset,
levelsOnly = TRUE)
withInteraction
predict(withInteraction, head(carSubset))
Adding appropriate interaction terms can better explain how predictors relate to the response.
$y = \beta_0 + \beta_1\,(x_1-\mu_1) + \beta_2\,(x_2-\mu_2)$ may fail to capture how $x_1$ and $x_2$ interact in their effect on $y$.
$y = \beta_0 + \beta_1\,(x_1-\mu_1) + \beta_2\,(x_2-\mu_2) + \beta_{12}\,(x_1-\mu_1)(x_2-\mu_2)$ may be a more appropriate model for data that has an interaction effect.
Measuring the performance quality of the various prediction models to select from is important in the modeling process. Some models will fit a sample well, but fail to give good predictive values for future observations.
Assume we know all $N$ elements of a population having predictors $\mathbf{x}_i$ and response $y$. A common measure of predictive performance of function $f(\mathbf{x})$ is the mean squared error (MSE) as given by $$MSE(f) = \frac1N \sum_{i=1}^N\left(y_i - f(\mathbf{x}_i)\right)^2.$$ The better our prediction model $f(\mathbf{x}_i)$ approximates $y_i$ given the $\mathbf{x}_i$ in the population we are predicting, the lower the MSE for $f$.
Note that $f(\mathbf{x})$ does not have to approximate $y$ well for just any $\mathbf{x}$, only for the part of the predictor space $\mathbb{R}^p$ when the population members $\{\mathbf{x}_i\}$ exist.
Most likely the population is unknown so the previous MSE formula cannot be used. If you have an observed sample of size $n$ from the system you are modeling, then the sample MSE of prediction function $f$ is $$MSE(f) = \frac1n \sum_{i=1}^n\left(y_i - f(\mathbf{x}_i)\right)^2.$$
In general we don't care about the training MSE, there are a multitude of models that can yield close to zero MSE but would not be appropriate for predicting future observations. (An MSE of zero won't be possible if there are duplicate observations that have different response values.)
However, if we have a set of observations $(\mathbf{x}_i, y_i), i = 1, \dots, n$, that are independent from a different set of observations $(\mathbf{x}_i, y_i), i = n+1, \dots, n+m$, then we can use the first $n$ observations to train the prediction model $f(\mathbf{x})$, as you would train a dog to do tricks. The trick here is finding a model that generates predictive values for new observations $\mathbf{x}$ such $f(\mathbf{x})$ is close to its true values $y$.
We can assess how the trained function $f(\mathbf{x})$ will perform on the population (or part of the population) we want to predict by using the additional $m$ observations not used in training. Since we know the $y$ for those $m$ values, we can calculate our model's test MSE using $$MSE_{test}(f) = \frac1m \sum_{i=n+1}^{n+m}\left(y_i - f(\mathbf{x}_i)\right)^2.$$
This test MSE is a good estimate of how well our function $f$ will predict values for $\mathbf{x}$ in the population, but only if these $m$ observations are representative of the observations we will encounter when using implementing our prediction model.
$$\Big[\; n+m \; \text{observations} \; \Big] \quad \longrightarrow \quad \Big[\; n \; \text{training observations} \; \Big] + \Big[\; m \; \text{testing observations} \; \Big]$$What if the $n$ observations are also representative of the future data we'll predict? Our performance assessment for the model-generated function $f$ would be more accurate if we include the $n$ training observations along with the $m$ observations of the test set.
But if we calculate the training MSE for $f$ equal to $$MSE_{train}(f) = \frac1n \sum_{i=1}^{n}\left(y_i - f(\mathbf{x}_i)\right)^2$$ we do not get a good assessment of how $f$ will predict new data because $f$ learns to predict those $n$ observations well (so $MSE_{train}(f)$ will be low no matter how well $f$ predicts new observations).
What if we flip the roles of $n$ and $m$ above, and use the $m$ observations to train $f$ and then the $n$ observations for testing? $$\Big[\; n+m \; \text{observations} \; \Big] \quad \longrightarrow \quad \Big[\; n \; \text{testing observations} \; \Big] + \Big[\; m \; \text{training observations} \; \Big]$$
When we train our prediction model on the $m$ observations, we will get a different $f^{(m)}$ than the one from training on the $n$ observations $f^{(n)}$. That doesn't matter because we are assessing the prediction model that generates the functions $f(\mathbf{x})$ rather than the specific functions $f$.
Let's do this: using the $n$ observations our prediction model gives function $f^{(n)}(\mathbf{x})$ that we assess with the $m$ observations, and then do this again using the $m$ observations in our prediction model to get function $f^{(m)}(\mathbf{x})$ assessed with the $n$ observations. Our overall assessed MSE is $$MSE(f) = \frac1{n+m} \left( \sum_{i=1}^{n}\left(y_i - f^{(m)}(\mathbf{x}_i)\right)^2 + \sum_{i=n+1}^{n+m}\left(y_i - f^{(n)}(\mathbf{x}_i)\right)^2 \right).$$
This MSE is likely the best assessment so far given that all $n+m$ observations are representative of the data we will have to predict.
The implementation above is flawed in that if $m$ is small relative to $n$, then the function $f^{(m)}$ is not likely to give good results due to its small training set size. Instead we can use equal-sized folds of the data to assess our prediction model $f$. Let's say we divide the data into 3 folds (and this example can be easily expanded to $k$ folds). Then we perform the following:
where $n_j$ is the number of observations in fold $j$.
Then the overall assessment of model $f$ is the average of these three $MSE_k$ values, or more properly: $$MSE(f) = \sum_{j=1}^k \frac{n_j}n\, MSE_j(f),$$ with the total number of observations $n$ being equal to ${\sum_{j=1}^k n_j}$.
Using $k=5$ or 10 folds should give an accurate assessment of the prediction model under these two important assumptions:
$k$-fold assessment works well because:
After a prediction model is selected based on its lowest $k$-fold MSE, the final model is trained with all $n$ observations, and that model's $f(\mathbf{x})$ is used for predicting values on new $\mathbf{x}$.
Carrying this idea to the extreme is Leave One Out Cross-Validation, using $k=n$. Some linear methods have convenient calculations for this metric (called generalized cross-validation) so the model doesn't have to be run $k$ times (each time without the observation used to predict).
As you increase the model complexity, you risk "overfitting" your observations by thinking random deviations in the training data are indicative of the true relationship between $\mathbf{x}$ and $y$, when in reality the model is just fitting noise.
But if your model is too simple, then the true relationship between $\mathbf{x}$ and $y$ cannot be accounted for with the function $f$. We say that model bias exists when the expected value of the fit $E(f)$ does not equal the mean truth value $\mu = E(y)$. The goal is to predict $\mu(\mathbf{x})$ for all $\mathbf{x}$.
Putting this together in the expected MSE formula: $$E[MSE] = \sigma^2 + (\text{Model Bias})^2 + (\text{Model Variance}).$$
The first term, $\sigma^2$ is due to the noise in the system, the variation that cannot be modeled. Your model can never do better than $\sigma^2$ as this portion is the irreducible error that cannot be predicted. This part is equal to the expected variance of the response about its mean value (the truth, $\mu(\mathbf{x})$): $$ \sigma^2 = E(y - \mu(\mathbf{x}))^2.$$
The second term, squared model bias, is the expected squared bias, or amount that our fit will differ from the truth $\mu(\mathbf{x})$ considering all possible training data sets we could have: $$ \text{Model Bias} = E_\mathcal{T}\big(f(\mathbf{x})\big) - \mu(\mathbf{x}).$$
The third term, model variance, is the expected contribution to the MSE due to the variation of the fits (considering all possible training data sets) about its mean fit: $$ \text{Model Variance} = E_\mathcal{T}\Big(f(\mathbf{x}) - E_\mathcal{T}\big(f(\mathbf{x})\big)\Big)^2.$$
Calculation of the MSE for $f$ takes place over the appropriate range of $\mathbf{x}$, as the above quantities are defined at a single point $\mathbf{x}$.
Imagine fitting a collection of 10 points located at 1 through 10 with a linear fit or 7th-order polynomial when the truth is $\mu = x$.
Five different training data sets (from the infinite possible training sets) are:
set.seed(0)
train = data.frame(run = rep(1:5,each=10), x = rep(1:10,5))
train$y = train$x + rnorm(nrow(train))
xs = seq(0.5,10.5,,1000)
test = data.frame(run = NA, x = xs, y = NA)
for (j in 2:7){
train[,j+2] = train$x ^ j
test[,j+2] = test$x ^ j
}
fit1 = fit7 = matrix(NA, max(train$run), length(xs))
for (r in 1:max(train$run)){
lm1 = lm(y ~ x, data=train[train$run == r,-1])
fit1[r,] = predict(lm1, test)
lm7 = lm(y ~ ., data=train[train$run == r,-1])
fit7[r,] = predict(lm7, test)
}
plotsize(8,4)
par(mfrow=c(1,3))
plot(train$x, train$y, pch=train$run, col=train$run, xlab='x', ylab='y', las=1,
main='5 training sets')
abline(0,1,col='orange',lwd=2)
plot(train$x, train$y, pch=train$run, col=train$run, xlab='x', ylab='y', las=1,
main='linear fits have\nlow bias and low variance')
for (r in 1:max(train$run)){
lines(xs,fit1[r,],col=r)
}
abline(0,1,col='orange',lwd=2)
plot(train$x, train$y, pch=train$run, col=train$run, xlab='x', ylab='y', las=1,
main='7th-order polynomial fits have\nlow bias but high variance')
for (r in 1:max(train$run)){
lines(xs,fit7[r,],col=r)
}
abline(0,1,col='orange',lwd=2)
T = 1000
xs = seq(0.5,10.5,,1000)
results1 = results7 = matrix(0, T, length(xs))
for (trial in 1:T){
set.seed(trial)
train = data.frame(x = 1:10)
train$y = train$x + rnorm(nrow(train))
test = data.frame(x = xs, y = NA)
for (j in 2:7){
train[,j+1] = train$x ^ j
test[,j+1] = test$x ^ j
}
lm1 = lm(y ~ x, data=train)
results1[trial,] = predict(lm1, test)
lm7 = lm(y ~ ., data=train)
results7[trial,] = predict(lm7, test)
}
plotsize(7,4)
par(mfrow=c(1,2))
plot(c(1,10),c(-1,12), type='n', xlab='x', ylab='y', las=1,
main='range of 80% of\nlinear fits')
abline(0,1,col='orange',lwd=2)
lines(xs, colMeans(results1), lty=2)
lines(xs, apply(results1,2,quantile,prob=.9), col='red', lty=2)
lines(xs, apply(results1,2,quantile,prob=.1), col='red', lty=2)
legend('topleft',c('mean','P10,P90'),lty=2,col=c(1,2),cex=1,y.intersp=2.5)
plot(c(1,10),c(-1,12), type='n', xlab='x', ylab='y', las=1,
main='range of 80% of\n7th-order polynomial fits')
abline(0,1,col='orange',lwd=2)
lines(xs, colMeans(results7), lty=2)
lines(xs, apply(results7,2,quantile,prob=.9), col='red', lty=2)
lines(xs, apply(results7,2,quantile,prob=.1), col='red', lty=2)
legend('topleft',c('mean','P10,P90'),lty=2,col=c(1,2),cex=1,y.intersp=2.5)
The fact the mean fit aligns with the truth means both methods are unbiased, or $E(f) = \mu$.
Let the truth function be $\mu(\mathbf{x}) = 5 (x + x^2 + x^3)$ on the range $(-1, 1)$. We'll fit polynomials from order 1 to order 5 and confirm that a third-order polynomial is best in terms of MSE.
n = 20
J = 250
P = 5
set.seed(0)
x <- runif(n*J, -1, 1)
xte <- seq(-1,1,,1000)
y <- 5*x + 5*x^2 + 5*x^3 + rnorm(n*J)
yte <- 5*xte + 5*xte^2 + 5*xte^3
TrainMSE= rep(0,P)
SqBias = rep(NA,P)
MSE = rep(0,P)
for (p in 1:P){
train = test = data.frame(y=y)
for (i in 1:p){
train[,i+1] = x^i
test[,i+1] = xte^i
}
overall.lm = lm(y~., data=train)
yhat = predict(overall.lm, test)
SqBias[p] = sum((yhat - yte)^2)/length(yte)
for (j in 1:J){
obs = 1:n + n*(j-1)
indiv.lm = lm(y~., data=train[obs,])
yhat = predict(indiv.lm, test)
MSE[p] = MSE[p] + (1/J) * sum((yhat - yte)^2)/length(yte)
TrainMSE[p] = TrainMSE[p] + (1/J) * sum(indiv.lm$residuals ^ 2)/n
}
}
Variance = MSE - SqBias
plotsize(7,4)
par(mfrow=c(1,2))
plot(1:P, SqBias, ylim=c(0,max(MSE)), las=1,
main='MSE components vs.\npolynomial complexity',
ylab='squared error', xlab='model complexity')
points(1:P, Variance, col=2, pch=3)
points(1:P, MSE, col=4, pch=2)
points(1:P, TrainMSE, col=6, pch=4)
legend(2.5,16.5,c('sq.bias','variance','total MSE','train MSE'),
col=c(1,2,4,6),pch=c(1,3,2,4),y.intersp=2.5, cex=.75)
plot(1:P, SqBias, ylim=c(0.7,max(MSE)), log='y', las=1,
main='MSE vs. complexity\non log scale',
ylab='squared error', xlab='model complexity')
points(1:P, Variance, col=2, pch=3)
points(1:P, MSE, col=4, pch=2)
points(1:P, TrainMSE, col=6, pch=4)
Note that training MSE decreases with model complexity, even though polynomials 4 or 5 do not fit test data well.
As models grow in complexity through increased flexibility, they are harder to interpret. We will see why LASSO regression (should be as flexible as least squares) is popular due it's flexibility in selecting from a wide range of predictors. Random forests are methods that make complex predictions on a large number of predictors but offer some techniques to assess the importance of the different predictors. Random forests would be located just above Bagging in the chart.
We'll learn most of these methods in this course.
From James et al., Introduction to Statistical Learning (2013)
Some prediction models have parameters that determine how it generates fit functions $f$. I think of these parameters as knobs that have to be set in order to run the model. For example, in $k$-nearest neighbors regression, where the fit function uses the average of the $k$ nearest observations in the training data, the value of $k$ must be set to generate prediction values.
plotsize(6,4)
plot(datasets::cars, main='Stopping distance vs. vehicle speed')
knnFit <- train(dist ~ ., data=datasets::cars, method = "knn", metric = "RMSE",
tuneGrid = data.frame(k=1:9), trControl = trainControl(method = "cv"))
knnFit
xs = seq(0,30,,1000)
knn.model <- knn.reg(datasets::cars$speed, test=data.frame(speed=xs),
y=datasets::cars$dist, k=5)
str(knn.model)
plot(datasets::cars, main='Predicted stopping distance using\nk-nearest neighbors')
for (k in c(25,10,5)){
knn.model <- knn.reg(datasets::cars$speed, test=data.frame(speed=xs),
y=datasets::cars$dist, k=k)
if (k == 5){
lines(xs, knn.model$pred, col=4, lwd=4)
} else {
lines(xs, knn.model$pred, col=ceiling(sqrt(k/0.1)), lwd=4)
}
}
legend('topleft',c('k=5','k=10','k=25'),col=c(4,2,8), lwd=3, y.intersp=2)
Many methods in the prediction modeling toolbox have parameters that require setting, like the number of neighbors $k$ here. Proper tuning is important to determine the value to set these parameters in order to get good predictive values. Without tuning these knobs, you run the risk of overfitting the training data, for example when $k=1$, see above, or underfitting the data, such as the gray $k=25$ fit.
Predicting fuel economy for vehicles using a data set with many predictors (from Chapter 2 of Applied Predictive Modeling)
################################################################################
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
### Web Page: http://www.appliedpredictivemodeling.com
### Contact: Max Kuhn (mxkuhn@gmail.com)
###
### Chapter 2: A Short Tour of the Predictive Modeling Process
###
### Required packages: AppliedPredictiveModeling, earth, caret, lattice
###
### Data used: The FuelEconomy data in the AppliedPredictiveModeling package
###
### Notes:
### 1) This code is provided without warranty.
###
### 2) This code should help the user reproduce the results in the
### text. There will be differences between this code and what is is
### the computing section. For example, the computing sections show
### how the source functions work (e.g. randomForest() or plsr()),
### which were not directly used when creating the book. Also, there may be
### syntax differences that occur over time as packages evolve. These files
### will reflect those changes.
###
### 3) In some cases, the calculations in the book were run in
### parallel. The sub-processes may reset the random number seed.
### Your results may slightly vary.
###
################################################################################
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 2.1 Case Study: Predicting Fuel Economy
data(FuelEconomy)
## Format data for plotting against engine displacement
## Sort by engine displacement
cars2010 <- cars2010[order(cars2010$EngDispl),]
cars2011 <- cars2011[order(cars2011$EngDispl),]
## Combine data into one data frame
cars2010a <- cars2010
cars2010a$Year <- "2010 Model Year"
cars2011a <- cars2011
cars2011a$Year <- "2011 Model Year"
plotData <- rbind(cars2010a, cars2011a)
library(lattice)
plotsize(6,4)
xyplot(FE ~ EngDispl|Year, plotData,
xlab = "Engine Displacement",
ylab = "Fuel Efficiency (MPG)",
between = list(x = 1.2))
# (C) Kuhn and Johnson, 2013
## Fit a single linear model and conduct 10-fold CV to estimate the error
set.seed(1)
lm1Fit <- train(FE ~ EngDispl,
data = cars2010,
method = "lm",
trControl = trainControl(method= "cv"))
lm1Fit
# (C) Kuhn and Johnson, 2013
## Fit a quadratic model too
## Create squared terms
cars2010$ED2 <- cars2010$EngDispl^2
cars2011$ED2 <- cars2011$EngDispl^2
set.seed(1)
lm2Fit <- train(FE ~ EngDispl + ED2,
data = cars2010,
method = "lm",
trControl = trainControl(method= "cv"))
lm2Fit
# (C) Kuhn and Johnson, 2013
## Finally a MARS model (via the earth package)
set.seed(1)
marsFit <- train(FE ~ EngDispl,
data = cars2010,
method = "earth",
tuneLength = 15,
trControl = trainControl(method= "cv"))
marsFit
plotsize(5,3)
plot(marsFit)
head(cars2011)
cars.dummy <- dummyVars(~ Transmission + AirAspirationMethod + DriveDesc + CarlineClassDesc,
data=rbind(cars2010,cars2011))
# combine the 2 data sets in case there are different levels in them
cars.dummy
cars2010D <- predict(cars.dummy, cars2010)
cars2011D <- predict(cars.dummy, cars2011)
head(cars2011D)
quantcol <- sapply(cars2010[1,],is.numeric)
cars2010Q <- cbind(cars2010[,quantcol],cars2010D)
cars2011Q <- cbind(cars2011[,quantcol],cars2011D)
head(cars2010Q)
dim(cars2010Q)
colremove <- which(colnames(cars2010Q) %in% c("ED2", "FE"))
cars.pp <- preProcess(cars2010Q[,-colremove], method=c("BoxCox", "center", "scale"))
X10 <- predict(cars.pp, cars2010Q[,-colremove])
X11 <- predict(cars.pp, cars2011Q[,-colremove])
X10$FE <- cars2010$FE
X11$FE <- cars2011$FE
cars.pp
head(X11)
It turns out the MARS algorithm in caret doesn't need explicit dummy variables.
head(cars2010,3)
## Trying to make a better MARS model (via the earth package)
set.seed(1)
startTime <- proc.time()[3]
# Trying nprune in [3,9] and degree in [1,3], the minimum is nprune=8, degree=2.
# To keep runtimes short, I pare the grid to [5,9] and [1,2].
tuneMatrix <- data.frame(nprune = rep(5:9,2), degree = rep(1:2, each=5))
marsFit2 <- train(FE ~ .,
data = cars2010,
method = "earth",
tuneGrid = tuneMatrix,
trControl = trainControl(method= "cv"))
cat(sprintf("Runtime = %.1f sec", proc.time()[3]-startTime))
# takes about 20 sec min to run
marsFit2
## a knn model based on the scaled data
set.seed(1)
startTime <- proc.time()[3]
knnFit <- train(FE ~ .,
data = X10,
method = "knn",
tuneLength = 6,
trControl = trainControl(method= "cv"))
cat(sprintf("Runtime = %.1f sec", proc.time()[3]-startTime))
# takes about 1 sec to run
knnFit
## Fit a linear model on everything
set.seed(1)
lm3Fit <- train(FE ~ .,
data = X10,
method = "lm",
trControl = trainControl(method= "cv"))
lm3Fit
# Removing correlated variables
X10Corr <- cor(X10)
plotsize(8,8)
corrplot(X10Corr, order = "hclust", tl.cex = .35)
FEcol <- which(colnames(X10) %in% "FE")
highCorr <- findCorrelation(X10Corr[-FEcol,-FEcol], .9)
colRemove <- colnames(X10)[highCorr]
colRemove
X10R <- X10[,-which(colnames(X10) %in% colRemove)]
X11R <- X11[,-which(colnames(X10) %in% colRemove)]
colnames(X10R)
## Fit a linear model on everything
set.seed(1)
lm4Fit <- train(FE ~ .,
data = X10R,
method = "lm",
trControl = trainControl(method= "cv"))
lm4Fit
# (C) Kuhn and Johnson, 2013
## Predict the test set data
cars2011$lm1 <- predict(lm1Fit, cars2011)
cars2011$lm2 <- predict(lm2Fit, cars2011)
cars2011$mars1 <- predict(marsFit, cars2011)
cars2011$mars2 <- predict(marsFit2, cars2011)
cars2011$knn <- predict(knnFit, X11)
## Get test set performance values via caret's postResample function
postResample(pred = cars2011$lm1, obs = cars2011$FE)
postResample(pred = cars2011$lm2, obs = cars2011$FE)
postResample(pred = cars2011$mars1, obs = cars2011$FE)
postResample(pred = cars2011$mars2, obs = cars2011$FE)
postResample(pred = cars2011$knn, obs = cars2011$FE)
cars2011$lm3 <- predict(lm3Fit, X11)
postResample(pred = cars2011$lm3, obs = cars2011$FE)
cars2011$lm4 <- predict(lm4Fit, X11R)
postResample(pred = cars2011$lm4, obs = cars2011$FE)
plotsize(7,4)
plot(cars2011$EngDispl, cars2011$FE, cex=.5)
lines(cars2011$EngDispl, cars2011$lm2, col=2)
points(cars2011$EngDispl, cars2011$lm3, col=4, pch=2, cex=.6)
plotsize(7,6)
plot(cars2011$EngDispl, cars2011$FE, cex=.5)
lines(cars2011$EngDispl, cars2011$lm2, col=2)
for (i in 1:nrow(cars2011)){
lines(rep(cars2011$EngDispl[i],2),c(cars2011$FE[i],cars2011$lm2[i]), col=2, lwd=2)
lines(rep(cars2011$EngDispl[i]+runif(1,-.1,.1),2),
c(cars2011$FE[i],cars2011$lm3[i]), col=4, lwd=2)
}
plotsize(8,8)
par(mfrow=c(2,1))
orderx = order(cars2011$EngDispl)
plot(1:nrow(cars2011), cars2011$FE[orderx], cex=.5, xlab='Observation',
ylab='Fuel Eff. residual',
main='Quadratic fit')
for (i in 1:nrow(cars2011)){
lines(rep(i,2),c(cars2011$FE[orderx[i]],cars2011$lm2[orderx[i]]), col=2, lwd=2)
}
plot(1:nrow(cars2011), cars2011$FE[orderx], cex=.5, xlab='Observation',
ylab='Fuel Eff. residual',
main='Multivariate linear fit')
for (i in 1:nrow(cars2011)){
lines(rep(i,2),c(cars2011$FE[orderx[i]],cars2011$lm3[orderx[i]]), col=4, lwd=2)
}
The multivariate linear fit seems to provide better fits than the quadratic fit of engine displacement, despite the warnings of the data matrix $\mathbf{X}$ being rank deficient.